The Annals of Applied Statistics — Latest Matching Preprints

1

The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics

Tang, Q.; Chi, E. C.; Wang, W.

2026-05-20 bioinformatics 10.64898/2026.05.18.725545 medRxiv

Top 0.1%

12.8%

Show abstract

We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.

2

Inference of enhancer-specific transcription factor interactions from gene expression data using a biophysical model

Safaeesirat, A.; Taeb, H.; Emberly, E.

2026-06-08 biophysics 10.64898/2026.06.03.729923 medRxiv

Top 0.1%

6.7%

Show abstract

Transcription factors (TFs) play a central role in gene expression and regulation. In recent years, numerous experimental techniques have generated large-scale datasets, alongside computational methods aimed at inferring the role of TF-TF interactions in gene regulation. However, these approaches typically yield global interaction patterns across datasets, which may not accurately reflect local regulatory interactions at specific enhancers. Here, we model transcription using an Ising-type biophysical framework and introduce approximations based on its mean-field representation to infer TF-TF interactions at the level of individual enhancers from expression data, such as STARR-seq or fluorescent protein measurements. We validate our approach using simulated data and evaluate the effect of the strengths of TF-TF and TF-DNA interactions on inference accuracy. We then apply the model to experimental fluorescence data of gap genes for the eve stripe-2 (eve2) enhancer in the fruit fly embryo. The model successfully infers the established roles of the gap genes and predicts the possibility of cooperative and antagonistic interactions among them, which can be experimentally investigated.

3

Recalibrating Mendelian randomization under winner's curse, sample structure and polygenicity

Yang, Y.; Lin, Z.; Xue, H.; Zhu, X.

2026-07-07 genetic and genomic medicine 10.64898/2026.06.25.26356593 medRxiv

Top 0.1%

6.6%

Show abstract

Recently, Hu et al. (2024) conducted a benchmarking study showing that most existing Mendelian randomization (MR) methods exhibit substantial bias and inflated type-I error rates in real data. They attributed these failures to two largely neglected sources of bias: winner's curse and polygenicity-induced bias. Although a few methods have been developed to address one or both of these issues, existing approaches either do not fully account for both biases or are restricted to the univariable setting. In this paper, we propose a multivariable Rao-Blackwellization that corrects winner's curse while accounting for polygenicity and sample structure in a unified framework. Unlike univariable Rao-Blackwellization, where instrument selection yields a truncated normal statistic amenable to a Mills-ratio correction, multivariable Rao-Blackwellization conditions on a noncentral $\chi^2$ statistic, for which no analogous correction is available. We derive closed-form conditional moments under this instrument selection model and use them to construct bias-corrected summary statistics that can be integrated into a wide range of existing MR methods. Simulations and real data analyses show that, when combined with methods such as MR-cML and MR-BEE, the proposed correction substantially improves type-I error control and yields more robust inference.

4

Unlocking Multi-Sample Differential Expression for Spatial Transcriptomics Data with TESSERA

Constantine, F.; Laszik, Z.; Dudoit, S.; Purdom, E.

2026-04-30 bioinformatics 10.64898/2026.04.27.720955 medRxiv

Top 0.1%

6.6%

Show abstract

Spatial transcriptomics allows the unprecedented examination of gene expression levels at the resolution of spatially-situated single cells in a high-throughput manner. As the technology is adopted more broadly, studies frequently collect data from multiple tissue samples, which leads to unique challenges that traditional spatial statistical methods are not equipped to handle. In particular, factors that differ across samples, such as different coordinate systems, different numbers and types of cells, different underlying tissue architectures, among others, preclude the application of traditional methods to our new setting. In this work, we propose a novel method, TESSERA, based on a spatial generalized linear model, for analyzing multi-sample spatial transcriptomics count data. Importantly, we provide a mathematical and computational framework for efficient and scalable model fitting and statistical inference to accompany the specification of our model. Our method for fitting the model enables the estimation of a common set of fixed effects across samples. This allows us to address a variety of differential expression questions, such as identification of which genes are differentially expressed between conditions (e.g., diseases, treatments), while accounting for spatial correlation between cells within a sample. We benchmark our proposed method on simulated data and apply it to a spatial transcriptomics dataset of human kidney samples. We find that our method provides a hitherto nonexistent extension to the multi-sample setting while remaining competitive with or outperforming existing algorithms in the single-sample setting.

5

Multi-resolution Spatial Graphical Regression Models for Hierarchical Spatial Transcriptomics Data

Chen, L.; Acharyya, S.; May, A. M.; Udager, A. M.; Keller, E. T.; Baladandayuthapani, V.

2026-05-15 genomics 10.64898/2026.05.12.724724 medRxiv

Top 0.1%

6.2%

Show abstract

Advances in spatial transcriptomics (ST) technologies enable systematic molecular characterization of tumor microenvironment, tumor gradients and gene regulatory networks. Cancer progression is known to vary along pathological gradients, yet existing network approaches for gene network inference typically ignore hierarchical spatial organization across the tumor. We develop a Bayesian multi-resolution spatial graphical regression (mSGR) framework to infer spatially varying gene networks from multi-resolution ST data. The proposed model allows precision matrices to vary across hierarchically structured spatial domains, capturing both local and global organization within the tumor. To identify spatially varying regulatory relationships, we introduce a spatially structured edge selection strategy that borrows strength across regions according to spatial proximity and pathological gradients, while Gaussian-process priors flexibly model spatial variation in edge strengths. Scalable inference is achieved through an augmented mean-field variational Bayes algorithm with node-wise parallel regressions, enabling efficient estimation in high-dimensional settings. Simulation studies demonstrate improved recovery of network structures compared with competing approaches. Applying mSGR to multi-resolution ST data from kidney cancer reveals stronger regulatory connectivity in transitional regions of epithelial-mesenchymal transition pathway and identifies hub genes along the tumor gradient, illustrating how spatially resolved network analysis can provide key insights into tumor microenvironment organization.

6

Correcting spatial transcriptomics data affected by a prevalent transcript leakage problem across platforms, species, and tissues

Shi, C. H.; Zhai, Y.; Chow, S. H.-C.; Li, L.; Carver, C. M.; Teneche, M. G.; Flores, J.; Kern, C.; Adams, P. D.; Ren, B.; Schafer, M. J.; Zhu, Q.; Wei, Y.; Yip, K. Y.

2026-06-17 bioinformatics 10.64898/2026.06.13.732076 medRxiv

Top 0.1%

6.1%

Show abstract

Spatial transcriptomics has been widely applied to study the spatial distribution of cell types, cell states, and specific gene expression in tissue samples. However, we show that there is a prevalent transcript leakage problem in spatial transcriptomics data, where transcripts expressed by a cell diffuse to its neighborhood and are recurrently detected in the nearby cells. By analyzing published data sets, we show that this problem is general across data produced from different tissues and different species using different imaging-based and sequencing-based spatial transcriptomics platforms. It affects both upstream tasks such as expression quantification as well as downstream tasks such as cell-type annotation and detection of spatially-dependent gene expression. To tackle the transcript leakage problem, we propose a reference-free Bayesian model-based method, DeLeakage, which cleans up the data much more effectively than existing denoising methods. DeLeakage also improves cell-type annotation and avoids false detection of spatially dependent expression.

7

Model-based inference of gene expression noise from single-cell RNA-sequencing data

Giersdorf, F.; Rogers, D. W.; Christensen, S.; Dutheil, J. Y.

2026-06-23 bioinformatics 10.64898/2026.06.18.733122 medRxiv

Top 0.1%

4.4%

Show abstract

The heterogeneity of expression levels among genetically identical cells, termed gene expression noise, is a property of the gene expression process whose importance in the biology of organisms and their evolution is increasingly recognized. Measuring gene expression noise requires single-cell expression data, as obtained from single-cell RNA sequencing (scRNASeq). Its estimation, however, is challenging owing to (i) the presence of technical noise in addition to biological noise, and (ii) the heterogeneity of cell types in the sampled population. We propose a maximum-likelihood framework to infer biological noise from scRNASeq data, while accounting for technical noise, dropout probabilities, and distinct cell sequencing depths. We demonstrate the parameter identifiability using simulations and that the resulting noise estimates are uncorrelated from the mean gene expression, and therefore do not need extra correction in downstream analyses, easing intra- and inter-genome comparisons. Using two technical replicates of scR-NASeq data from the wild yeast Saccharomyces paradoxus, we show that expression noise can be inferred in a reproducible manner.

8

Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv

Top 0.1%

4.3%

Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

9

A Beta-Binomial Model for Estimating Zero- or One-inflated Pain Trajectories

Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.

2026-05-11 bioinformatics 10.64898/2026.05.07.721507 medRxiv

Top 0.1%

4.2%

Show abstract

Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.

10

inGSEA: An Improved Method for Gene Set Enrichment Analysis Using a Weighted Integral Statistic

Zhang, Q.; Li, Q.

2026-06-05 bioinformatics 10.64898/2026.06.02.729106 medRxiv

Top 0.1%

4.0%

Show abstract

Gene Set Enrichment Analysis (GSEA) is one of the most popular methods for transcriptomic analysis, yet its statistical power is limited when the biological pathways exhibit heterogeneous or non-concordant expression patterns. We propose an improved GSEA method, integral-based GSEA (inGSEA). inGSEA introduces a novel enrichment score based on the Anderson-Darling weighted integral statistic. The new enrichment score enhances detection power for complex signals, particularly sparse and bidirectional ones, while the Cauchy combination of integral and classic maximum statistics provides robustness across diverse expression patterns. Extensive numerical studies demonstrate that inGSEA achieves superior power and well-calibrated false discoveries. Application to real-world datasets reveals biologically relevant pathways missed by the standard GSEA. inGSEA reduces the computational burden of permutation testing by employing a generalized gamma distribution to approximate the null distribution. inGSEA is accessible as a user-friendly web-based software tool (https://amss-stat.github.io/inGSEA).

11

Robust Inference of Individualized Treatment Effect in Mendelian Randomization

Liang, M.; Wu, R.; Xiao, F.; Li, X.

2026-05-12 genetics 10.64898/2026.05.08.723855 medRxiv

Top 0.1%

4.0%

Show abstract

Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.

12

Infectious Disease Forecasting via Physics-Informed Machine Learning

Hart, J. C.; Smith, H.; McMahan, C.; Rennert, L.

2026-06-16 bioinformatics 10.64898/2026.06.12.731957 medRxiv

Top 0.1%

3.3%

Show abstract

Infectious disease transmission evolves as a dynamic process shaped by biological mechanisms, population behavior, and intervention policies, yet public health responses are often driven by lagging indicators. Accurate short- and long-term disease forecasting is essential for the timely deployment of intervention strategies, healthcare capacity planning, and uncertainty-aware, risk-informed decision-making. To address this challenge, three broad classes of forecasting models have traditionally been used: statistical, machine learning, and mechanistic approaches. However, each of these modeling paradigms faces fundamental limitations. In particular, traditional statistical models often lack the flexibility needed to capture complex disease dynamics, machine learning approaches require large, high-quality data streams, and mechanistic models are notoriously difficult to calibrate. To overcome these challenges, we propose a novel physics-informed machine learning (PIML) framework for forecasting infectious disease dynamics. Our approach simultaneously forecasts new case and hospitalization counts, along with other key epidemiological quantities such as the time-varying reproduction number. This is achieved through the design of a machine learning model and estimation strategy regularized by a system of differential equations that encode disease dynamics of the SIHR model, thereby bridging the gap between purely data-driven and mechanistic models. We demonstrate the proposed methodology through in-depth numerical studies and an application to COVID-19 data collected in the state of South Carolina.

13

A Hierarchical Bayesian Agent Based Model for Binary Spatio-Temporal Spread: Theory, PDE Scaling Limit, and an Application to Predator Prey Cycles

pan, x.

2026-06-08 ecology 10.64898/2026.06.03.729943 medRxiv

Top 0.1%

3.3%

Show abstract

We describe a statistical agent-based model (SABM) for binary spatio-temporal data in which the occupancy of each cell evolves as a Bernoulli mixture of three mechanistically distinct processes: local persistence, anisotropic neighborhood dispersal, and long-distance dispersal. The model is embedded in a hierarchical Bayesian framework with conjugate Beta full-conditionals for the persistence and long-distance parameters and a Dirichlet prior on the directional dispersal kernel. A nonstationary extension links the dispersal kernel to a latent habitat-suitability surface through directional gradients of a Gaussian process. We show that, in the small-step regime, the Lagrangian recurrence for the dispersal kernel scales to a classical two-dimensional advection-diffusion partial differential equation whose drift and dispersion coefficients are the first and second moments of the dispersal probabilities. We provide an MCMC algorithm exploiting the exact full-conditionals and demonstrate parameter recovery and PDE-scaling agreement in a simulated example.

14

Calibrating machine learning approaches for probability estimation without calibration data

Di Carluccio, E.; Koliopanos, G.; Ojeda, F. M.; Weimar, C.; Ziegler, A.

2026-07-13 epidemiology 10.64898/2026.07.10.26357723 medRxiv

Top 0.1%

3.1%

Show abstract

Statistical prediction models for binary outcomes are becoming increasingly popular. One significant challenge is calibrating these models to suit the characteristics of a target population that is structurally different from the original population. Calibration is especially challenging when there is no training data available from the target population. To address this problem, we propose a novel calibration method, SimCal, which uses synthetic data generated from the model development data in conjunction with marginal statistics from the calibration cohort. We show that expert judgment modeling (EJM) may be used for calibration if cross-sectional data from the target population are available comprising expert judgments about the potential outcome and the covariates. We describe three alternative calibration approaches when calibration data are lacking: similarity-binning averaging (SBA), adaptive calibration of predictions (ACP), and Elkan calibration. In a simulation study, we compare SBA, ACP, Elkan calibration, and SimCal. R code for applying these methods is provided from the re-analysis of data on coronary artery disease. We illustrate all 5 calibration approaches with a real data set for predicting functional outcome after stroke and all approaches but EJM in the re-analysis of the Cleveland Clinic data. None of the approaches performed convincingly well in all situations. SimCal performed well when model parameters were correctly specified. EJM failed on the stroke data. Further research is urgently required for calibration in the absence of calibration data.

15

Extracting Parsimonious Quantitative Predictors of Biological Effectiveness from 'First-Principles' Radiobiology: Application to the Mixed-Quality Problem

Yusufaly, T.; Transtrum, M.; Huang, L.; Sabok-Sayr, S.; Sgouros, G.; Hobbs, R.; Jia, X.

2026-05-06 biophysics 10.64898/2026.05.02.722446 medRxiv

Top 0.1%

2.5%

Show abstract

Developing parsimonious, mechanism-aware quantitative models that predict how biological effectiveness changes with different modifiers remains, in general, an unsolved problem. Advances in radiobiological research have created a large knowledge base of first-principles mechanistic models of radiation response that, in principle, could accurately predict radiosensitivity across different experimental and clinical conditions. However, in practice these mechanistic models come with an overabundance of parameters, the majority of which are practically unidentifiable and, moreover, likely unnecessary if one simply wishes to predict how radiosensitivity changes for some specific modifier of interest. Nevertheless, determining which few details in the full mechanistic model are relevant for a given purpose, as well as how to remove any other extraneous details, remains a highly non-trivial task. In this study, we demonstrate the potential of model reduction, starting from a detailed mechanistic description, as a systematic strategy for deriving parsimonious, experimentally falsifiable radiobiological descriptors. As a proof-of-concept demonstration, we apply the Manifold Boundary Approximation Method (MBAM) to a Mechanistic Model of DNA Repair and Survival (MEDRAS), for the problem of cell survival prediction following an acute exposure. Our findings reveal that the complete MEDRAS model for an arbitrary mixed-quality exposure can be structurally simplified to a reduced three-parameter model for an effective uniform-quality, named MEDRAS-LPL. Additional MBAM analysis on MEDRAS-LPL identifies two boundaries in parameter space, corresponding to sparsely ionizing and densely ionizing radiation. Mapping of MEDRAS-LPL parameter space on to effective LQ space further demonstrates that parameters close to the sparsely ionizing boundary line up with expectations from the theory of dual radiation, while parameters close to the densely ionizing boundary line up with expectations from a purely linear model based on a target-theory description. Moreover, our formalism predicts enhanced synergistic interactions between sparsely ionizing and densely ionizing radiation beyond the Zaider Rossi model (ZRM) paradigm, in line with empirical observations. The results highlight the potential for using reduced-order models not only for predictive applications but also for generating novel hypotheses that can inform future experimental designs and optimization strategies in radiobiology.

16

Quantum kernel support vector machines for trabecular bone classification: comparing feature reduction strategies on synthetic micro-CT data

Florez, I.; Farhat, A.; Le Houx, J.; Altamura, E.; Tozzi, G.

2026-05-07 biophysics 10.64898/2026.05.04.722627 medRxiv

Top 0.1%

2.2%

Show abstract

Quantum kernel methods offer a potential advantage for classification tasks in high-dimensional feature spaces, yet their practical benefit critically depends on how input features are prepared. We compare five dimensionality reduction strategies--principal component analysis (PCA), Gaussian random projection (RP Gaussian), sparse random projection (RP Sparse), partial least squares (PLS), and uniform manifold approximation and projection (UMAP) -- as pre-processing steps for quantum kernel support vector machines (SVMs) applied to trabecular bone classification from synthetic micro-computed tomography (micro-CT) data. Using a custom procedural generator based on Gaussian random field zero-crossings, we produced 500 synthetic trabecular bone volumes with controlled morphometric properties such as bone volume fraction (BV/TV), trabecular thickness (Tb.Th), number (Tb.N) and spacing (Tb.Sp). Texture features extracted from grayscale slices are reduced to 8-dimensional quantum circuit inputs via each method, then classified using both classical radial basis function (RBF)-SVMs and quantum kernel SVMs with ZZ feature maps on a statevector simulator, both evaluated with 5 x 5 repeated stratified cross-validation (25 folds). Our results show that UMAP is the only reduction method where the quantum kernel remains competitive with the classical baseline. Under repeated cross-validation, UMAP showed a +0.032 accuracy gap favouring the quantum kernel (Dietterich 5 x 2 CV p = 0.177); however, validation on 10 fully independent datasets--each with independently generated samples, separate reduction fits, and separate kernel matrices -- reversed the sign to -0.030 (paired t-test p = 0.123; Wilcoxon p = 0.193; quantum wins 3/10 datasets), indicating that the apparent advantage was likely inflated by fold dependence. Nevertheless, UMAPs gap remains small and non-significant in both analyses, whereas all linear methods (PCA, RP Gaussian, PLS) show substantial quantum deficits of -0.090 to -0.116 across BV/TV classification, with PCA and PLS remaining significant under corrected tests (5 x 2 CV p = 0.004 and p = 0.007 respectively). We additionally evaluate quantum kernel ridge regression for continuous morphometric prediction, finding that ZZ quantum kernels fail uniformly at regression (negative R2 for all methods except PLS at 4 qubits), suggesting that the ZZ kernel captures decision boundaries but not smooth metric structure. These findings provide practical guidance for feature engineering in near-term quantum machine learning pipelines and demonstrate that the choice of dimensionality reduction can determine whether quantum kernels remain competitive with classical baselines.

17

Uncertainty-aware localization microscopy by variational diffusion

Seitz, C.; Liu, J.

2026-05-05 bioinformatics 10.64898/2026.05.01.722206 medRxiv

Top 0.1%

2.2%

Show abstract

Fast extraction of physically relevant information from images using deep neural networks has led to significant advances in fluorescence microscopy and its application to the study of biological systems. For example, the application of deep networks for kernel density (KD) estimation in single-molecule localization microscopy (SMLM) has accelerated super-resolution imaging of densely labeled structures in the cell. However, localization of fluorescent molecules in dense images is a difficult inverse problem with potentially multiple solutions. To model a probability distribution of solutions to this problem, we propose a generative modeling framework for KD estimation in SMLM based on a conditional variational diffusion model (CVDM). In this framework, CVDM is trained to perform localization tasks on low-resolution measurements by modeling a distribution of high-resolution KD estimates. This approach allows us to probe the structure of the distribution on KD estimates and express uncertainty, which is not currently offered by existing deep models for localization microscopy. We demonstrate that this model permits high-fidelity super-resolution, enables the uncertainty estimation of regressed KD estimates, and has important implications for image restoration in single-molecule and super resolution microscopy.

18

Overinflation and overconcentration: why Cauchy perturbation kernels are the right choice for ABC-SMC

Sturrock, M.; Shahrezaei, V.

2026-07-09 systems biology 10.64898/2026.06.24.734205 medRxiv

Top 0.1%

2.1%

Show abstract

Approximate Bayesian computation sequential Monte Carlo (ABC-SMC) propagates its particles with a perturbation kernel, and with the standard Normal kernel it degrades sharply as the parameter dimension grows, a failure usually attributed to dimension itself. We show instead that it is governed by the quality of the summary statistics, with dimension entering only through a separate and milder mechanism, and that the two must act together for the Normal kernel to break. The first ingredient is covariance overinflation: the kernel covariance, estimated from the particle cloud, overshoots the true posterior covariance by a factor set by information loss in the summary statistics. We derive this overscaling factor in closed form for a Gaussian model with sufficient statistics and show that it stays modest at any dimension, shrinking toward its baseline value as the tolerance tightens; the extreme values seen in practice (of order 103) are a signature of insufficient summaries, not of dimension. The second ingredient is perturbation overconcentration: the normalised Normal step size concentrates around one as the dimension grows, so every proposal overshoots by the same factor. Either ingredient alone is harmless; only their combination breaks the Normal kernel. A Cauchy kernel (multivariate t with one degree of freedom) removes the concentration, keeping a positive acceptance rate under arbitrary overscaling at a bounded worst-case cost of 1.87x in expected squared jump distance. In a Metropolis-Hastings framework we derive closed-form acceptance rates for both kernels that illustrate the advantage of the Cauchy kernel in this limit. A series of full ABC-SMC computational experiments on five problems at d = 12, including a hierarchical gene-expression model, show the Cauchy reducing the sliced Wasserstein distance to the reference posterior by factors of up to 50 with the same simulation budget. Since the summary statistics are commonly insufficient for the models that require ABC, overinflation is structural and the Cauchy perturbation kernel is the right default for problems in higher dimensions.

19

Bayesian Inference of Bond Parameters from Interactions between Single Filaments

Pajanonot, K. A. T.; Lambert, S.; Kumari, P.; Koester, S.; Klumpp, S.

2026-06-02 biophysics 10.64898/2026.05.29.728470 medRxiv

Top 0.1%

2.1%

Show abstract

We map interaction forces between two vimentin filaments (cytoskeletal components crucial for cell mechanics) using optical tweezers, while controlling the relative velocity. We introduce a powerful Bayesian inference framework to learn bond parameters directly from force trajectories. The information gained about the bond parameters is maximized by an optimal relative velocity and further by distributing measurements across multiple velocities. Our Bayesian framework is broadly applicable to a large range of biomolecular interactions and force spectroscopy techniques.

20

Differential Expression Analysis for Longitudinal Single-Cell RNA-Sequencing Studies Using REBEL

Wynn, E. A.; Mould, K. J.; Vestal, B. E.; Moore, C. M.

2026-05-11 genomics 10.64898/2026.05.06.723139 medRxiv

Top 0.1%

1.9%

Show abstract

Longitudinal scRNA-seq experiments offer a powerful approach for dissecting temporal gene expression dynamics in individual cell types. However, few methods have been developed specifically to address the unique statistical challenges of repeated measures in scRNA-seq data. Here, we introduce a novel method, REBEL (Repeated measures Empirical Bayes differential Expression analysis using Linear mixed models), for analyzing cell type-specific differential expression in repeated measures scRNA-seq experiments. Using simulation studies, we demonstrate that, relative to conventional repeated measures analysis methods and other scRNA-seq approaches, REBEL controls the false discovery rate and exhibits competitive power across a range of simulation scenarios. We further validate REBEL by analyzing a longitudinal scRNA-seq dataset from patients with B-cell lymphoma receiving chimeric antigen receptor (CAR)-T cell therapy. REBEL is implemented as an R package, available at https://github.com/ewynn610/REBEL.